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ABSTRACT 

We present a novel technique to overcome the limitations of the applicability of Principal 
Component Analysis to typical real-life data sets, especially astronomical spectra. Our new 
approach addresses the issues of outliers, missing information, large number of dimensions 
and the vast amount of data by combining elements of robust statistics and recursive al- 
gorithms that provide improved eigensystem estimates step-by-step. We develop a generic 
mechanism for deriving reliable eigenspectra without manual data censoring, while utilising 
all the information contained in the observations. We demonstrate the power of the method- 
ology on the attractive collection of the VIMOS VLT Deep Survey spectra that manifest most 
of the challenges today, and highlight the improvements over previous workarounds, as well 
as the scalability of our approach to collections with sizes of the Sloan Digital Sky Survey 
and beyond. 
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1 MOTIVATION 

As modern telescopes collect more and more data in our expo- 
nential world, where the size of the detectors essentially follows 
Moore's law, the kind of statistical challenges astronomers face in 
analysing the observations change dramatically in nature. We need 
algorithms that scale well in time and complexity with the volume 
of the data, while obeying the constraints of today's computers. But 
the large data volume is only one of the consequences of this trend. 
With more observations in hand, the number of problematic detec- 
tions also increases. In addition to the elegant mathematical theo- 
rems that work miracles on textbook examples, scientists need to 
develop methodologies that provide reliable results that are robust 
in the statistical sense when applied to real-life data. 

One particular multi-variate analysis technique, which is 
widely accepted and popular not only in astronomy but also in 
genetics, imaging and many other fields, is Principal Component 
Analysis (PCA). For its simple geometric meaning and straight- 
forward implementation via Singular Value Decomposition (SVD), 
it has been utilised in many a reas of the field including classifi- 
cation of galaxies and quasars llFrancis et alJI 19921 : IConnollv et al.l 



ll995tlConnollv & Szalavlll999i ; lYip et alj|2004allbh. spectroscopic 
and p hotometric redshift estimation dGlazebrook Offer. & Deelev 



1998; Budavarietal.ll 1 999. 2000), sky subtraction (Wild & Hewett 
2005), and highly efficient optical spectral indicators dWild et al 
2007). However, direct application of the classic PCA to real data 
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is almost always impossible; the reasons are usually three-fold: 
(1) The technique is extremely sensitive to outliers. With smaller 
datasets, scientists would "clean up" the sample by completely re- 
moving the "obvious" outliers in one or more projections and anal- 
yse the remaining subset. The problem with this approach is that 
it is subjective, and it becomes impractical for large datasets. (2) 
Another problem is missing measurements in the data vectors, e.g., 
pixels of a spectrum. There are various reasons for this to occur: 
strong night sky lines, cosmic rays or simply because of the redshift 
that yields different restframe wavelength coverage for the spec- 
tra. The application of PCA implies the assumption of a Euclidean 
metric, but it is not clear how to calculate Euclidean distances when 
data is missing from our observed vectors. (3) Last but not least, the 
memory requirement of SVD is significant as the entire dataset is 
stored in memory while the decomposition is computed. For exam- 
ple, the Sloan Digital Sky Survey (SDSS) Data Release 6 contains 
a million spectra with 4000 resolution elements each. While ma- 
chines certainly exist today that have the required amount of RAM 
(~ 50GB), typical workstations have lesser resources. Additionally, 
in most situations, we only seek a small number of eigenvectors as- 
sociated with the largest eigenvalues, and SVD computes all the 
singular vectors in vain. 

The optical spec tra of the VIMOS VLT Deep Survey (VVDS; 
iLe Fevre et alj|2005» provides an extremely attractive dataset for 
galaxy evolution studies at high redshift, yet, due to their generally 
low signal-to-noise ratio they are unsuitable for traditional PCA. 
In this case, the challenge does not lie in the volume of the data 
set, rather in the natural limitations of high redshift observations. 
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Thanks to the careful processing by the VVDS Team, the spectra 
are well calibrated and each one contains valuable information for 
a PCA analysis. 

Our goal is to develop an algorithm to address all of the above 
issues in a way that is true to the spirit of the PCA and maintains 
its geometrically meaningful properties. In ij2]we detail the vari- 
ous layers of our solution to the problem, and in Sj3]we compare 
the performance of different techniques when applied to the VVDS 
spectra. In 2] we evaluate the results of the methods and analyse 
the emerging physical features, ^concludes our study. 



2 STREAMING PCA 

Our approach to the analysis is not the classical one. Instead of 
working with a data set, we aim to formulate the problem using 
the concept of a data stream. We want to incrementally improve 
our understanding of the properties of the data, deriving better and 
better eigenspectra through the incremental addition of new obser- 
vations. 

We develop an algorithm to recursively calculate the quantities 
of interest. As the first step and an illustration of the concept, we 
look at the calculation of the sample mean, 



well as the new observation vector y, 
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where {x„} are the observation vectors. One can define a series 
of estimates as 
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n - 1 1 



(2) 



n n 

It is easy to see that fi 1 = x± and fj, N = /x. This iterative formula 
is the key to our new procedure, where the best estimate of the 
sample mean at each step is 



M = 7M P rov + (1 - 7) a 

= Mprcv + (1 - l)V 



(3) 
(4) 



where we introduced the centered variable y — x — ^ plev - The 
weight parameter 7 lies between and 1, and may be a function of 
both the observation vector and the iteration step. 



2.1 Updating the Eigensystem 

The calculation of the sample covariance matrix is essentially iden- 
tical to that of the mean, except we average the outer products of 
the vectors. Let us solve for the eigenspectra that belong to the 
largest p eigenvalues that account for most of the sample variance. 
This means that the E p A p Ej is a good approximation to the full 
covariance matrix, where {A p , E p } is the truncated eigensystem. 
Hence, the recursion takes the form of 



C = 7C prcv + (1 - i)yy T 
« 7 E p A p eJ + (1 - 7 )yy 3 



(5) 
(6) 

Following iLi et al. I d2003h . we write the covariance matrix as the 
product of some matrix A and its transpose 

,t 
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where the matrix A has only (p + 1) columns, and is thus much 
smaller than the covariance matrix. The columns of A are the con- 
structed from the previous eigenvalues Afe and eigenspectra eu, as 



k = 1. 



(8) 
(9) 



If A = UWV T then the eigensystem of the covariance C will 
have eigenvalues of A = W 2 and eigenspectra equal to the 
singular-vectors, E = U. Therefore, this formalism allows us to 
update the eigensystem by solving the SVD of the much smaller A 
leading to a significant decrease in computational time. 

Following the above procedure, one can update the truncated 
eigensystem step-by-step by adding the observed spectra one-by- 
one to build the final basis. A natural starting point for the iteration 
is to run SVD on a small subset of observation vectors first and 
proceed with the above updates from there. 

2.2 Robustness against Outliers 

Before we turn to making the algorithm robust, to understand the 
limitations of PCA let us first review the geometric problem that 
PCA solves. The classic procedure essentially fits a hyperplane to 
the data, where the eigenspectra define the projection matrix onto 
this plane. If the truncated eigensystem consists of p eigenspectra 
in the matrix E p , the projector is E p E p , and the residual of the fit 
for the nth observation vector is written as 



(I- 



E P Ej)y„ 



(10) 



Using this notation, PCA simply solves the minimization problem 
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where r n = \r n \. The sensitivity of PCA to outliers comes from 
the fact that the sum will be dominated by the extreme values in the 
data set. 

Over the last couple of decades, a number of improvements 
have been proposed to overc ome this issue within the framew ork 
of robust statistics (e.g., see llvlaronna. Martin & Yohaill2006l for 
a concise overview). The current state-of-the-art techniqu e intro- 
duced bv lMaronn3 d2005l) is based on a robust M-estimate l lHuberl 
1 198 lh of the scale, called M-scale. Here we solve the new minimi- 
sation problem 

miner 2 (12) 

where a 2 is an M-scale of the residuals r 2 , which satisfies the equa- 
tion 
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where p is the robust function. Usually a robust p-function is bound 
and assumed to be scaled to values between p(0) = and p(oo) — 1. 
The parameter 5 controls the breakdown point where the estimate 
explodes due to too much outlier contamination. It is straightfor- 
ward to verify that in the non-robust maximum likelihood estima- 
tion (MLE) case with p(t) — t and S — 1, we recover the classic 
minimization problem with a being the root mean square (RMS). 

By implicit differentiation the robust solution yields a very 
intuitive result: the mean is a weighted average of the observation 
vectors, and the hyperplane is derived from the eigensystem of a 
weighted covariance matrix, 



E 



w n x n 



(14) 
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C = a 2 w n (x„-fi)(x n -fi) T J J ( w n rl) (15) 

where w n = Wir^/a 2 ) and W(t) — p'(t). The weight for each 
observation vector depends on a 2 , which suggests the appropriate- 
ness of an iterative solution, where in every step we sol ve for the 
eigens pectra and use them to calculate a new a 2 scale; see lMaronnal 
(2005) for details. One way to obtain the solution of eq.(| 1 3I> is to 
re-write it in the intuitive form of 
\ 

CT2 = MXX r " (16) 

n=l 

where the weights are w* = W*(r 2 /a 2 ) with W*(t) = p(t)/t. 
Although, this is not the solution as the right hand side contains a 2 
itself, it can be shown that its iterative re-evaluation converges to 
the solution. 

We take this approach one step further. By recursively calcu- 
lating the eigenspectra instead of the classic method, we can allow 
for a simultaneous solution for the scale a 2 , as well. The recursion 
equation for the mean is formally almost identical to the classic 
case, and we introduce new equations to propagate the weighted 



covariance matrix and the scale, 

V = Ti/Vcv + (! -7i)as (17) 

C = 72 C prcv + (1 - 72) o 2 yy T /r 2 (18) 

° 2 = T^prcv + (1 - 73) »V 2 /5 (19) 

where the 7 coefficients depend on the running sums of 1, id and 
wr 2 denoted below by u, v and q, respectively. 

71 = av plcv /v with v = av plev +w (20) 

72 = a%rcv/<? with q = aq prcv + wr 2 (21) 

73 = atiprcv/u with u — au pTav + 1 (22) 



The parameter a introduced here, which takes values between and 
1, adjusts the rate at which the evolving solution of the eigenprob- 
lem forgets about past observations. It sets the characteristic width 
of the sliding window over the stream of data; in other words, the 
effective sample sizeQ The value a — 1 corresponds to the classic 
case of infinite memory. Since our iteration starts from a non-robust 
set of eigenspectra, a procedure with a < 1 is able to eliminate the 
effect of the initial transients. Due to the finite memory of the re- 
cursion, it is clearly disadvantagous to put the spectra on the stream 
in a systematic order; instead they should be randomized for best 
results. 

It is worth noting that robust "eigenvalues" can be computed 
for any eigenspectra in a consistent way, which enables a mean- 
ingful comparison of the performance of various bases. To derive 
a robust measure of the scatter of the data along a given eigen- 
spectrum e, one can project the data on it, and formally solve the 
same equation as in eq.(|13t but with the residuals replaced with the 
projected values, i.e., for the fcth eigenspectrum r n = e&y n . The 
resulting a 2 value is a robust estimate of 

2.3 Missing Entries in Observations 

The other common challenge is the presence of gaps in the obser- 
vations, i.e., missing entries in the data vectors. Gaps emerge for 
numerous reasons in real-life measurements. Some cause the loss 
of random snippets while others correlate with physical properties 

1 For example, the sequence u rapidly converges to 1/(1 — q). 



of the sources. An example of the latter is the wavelength coverage 
of objects at different redshifts: the observed wavelength range be- 
ing fixed, the detector looks at different parts of the electromagnetic 
spectrum for different extragalactic objects. 

Now we face two separate problems. Using PCA implies that 
we believe the Euclidean metric to be a sensible choice for our 
data, i.e., it is a good measure of similarity. Often one needs to 
normalize the observations so that this assumption would hold. For 
example, if one spectrum is identical to another but the source is 
brighter and/or closer, their distance would be large. The normal- 
isation guarantees that they are close in the Euclidean metric. So 
firstly, we must normalise every spectrum before it is entered into 
the streaming algorithm. This step is difficult to do in the presence 
of incomplete dat a, hence we also have to "pa tc h" the missing data. 

Inspired by lEverson & Sirov ich ( 1995), Connollv & Szalav 

dl999h proposed a solution, where the gaps are filled by an unbiased 
reconstruction using a pre-calculated eigenbasis. A final eigenbasis 
may be calculated iteratively by continuously filling the ga ps with 
the pr evious eigenbasis until convergen ce is reached iY'w et all 
l2004ah . While lConnollv & Szalavl 1 19991) allowed for a bias in rota- 
tion only, the method has recently been extended to accommodate 
a shift in the normalisation of the data vectors dWild et all 1 20071 . 
Lemson et al., in preparation). Of course, the new algorithm pre- 
sented in this paper can use the previous eigenbasis to fill gaps in 
each input data vector as they are input, thus avoiding the need for 
multiple iterations through the entire dataset. 

The other problem is a consequence of the above solution. 
Having patched the incomplete data by the current best understand- 
ing of the manifold, we have artificially removed the residuals in the 
bins of the missing entries, thus decreased the length of the resid- 
ual vector. This would result in increasingly large weights being 
assigned to spectra with the largest number of empty pixels. One 
solution is to calculate the residual vector using higher-order eigen- 
spectra. The idea is to solve for not only the first p eigenspectra but 
a larger (p+q) number of components and estimate the residuals in 
the missing bins using the difference of the reconstructions on the 
two different truncated bases. 



3 THE VVDS SPECTRA 

The VIMOS VLT Deep survey (VVDS) is a deep spectroscopic 
redshift survey, targetting obj ects with apparent m agnitudes in the 
range of 17.5 < Iab «S 24 jLe Fevre et ai]|2005h . The survey is 
unique for high redshift galaxy surveys in having applied no fur- 
ther colour cuts to minimise contamination from stars, yielding a 
particularly simple selection function, making it a very attractive 
dataset for statistical studies of the high redshift galaxy population. 
In this work we make use of the spectra from the publicly available 
first epoch data relea se of the VVDS-0226-04 (VVDS-02h) field 
jLe Fevre et al.ll2005T) . The spectra have a resolution of R = 227 
and a dispersion of 7. 14A/pixel. They have a usable observed frame 
wavelength range, for our purposes, of ~5500-8500A. 

The first epoch public data release contains 8981 spectroscop- 
ically observed objects in the VVDS-02h field, we select only those 
with moderate to secure redshifts (flags 2, 3, 4, and 9) that lie in the 
redshift range 0.5 < z < 1.0. The redshift range is determined by 
the rest-frame spectral range we have chosen for this study. The 
final sample contains 3485 spectra. 

The VVDS dataset provides the ideal test for a robust PCA al- 
gorithm, because of the low signal-to-noise ratio of the spectra and 
the significant chance that outliers exist due to incorrect redshift 
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Figure 1. The mean spectrum and top four eigenspectra for the VVDS galaxies. The eigenspectra have been inverted where necessary to make the physical 
features easier to identify (i.e. absorption lines in absorption and emission lines in emission) Left: The result from classic PCA on 3485 spectra. Center: 
The result from classic PCA with iterative removal of outliers. The final dataset contains 2675 spectra. Right: The result from the new iterative-robust PCA 
algorithm. 



determinations. We have chosen the 4000 A break region to illus- 
trate the procedure because of the obvious impor tance of this spec- 
tral region for g alaxy evolution studies (e.g., Balog h~et alJI 19991 ; 
IWild et alH2007l) and also due to the wide variety of spectral fea- 
tures present for the PCA to identi fy. Eigenspectra si milar to those 
created in this analysis are used bv lWild et a l. (2008) for the iden- 
tification of H<5-strong galaxies in the VVDS survey. 



3.1 Classic and Trimmed Analysis 

To provide a comparison for the robust algorithm, we first per- 
form the classic PCA using an S VD algorithm. The spectra are cor- 
rected for Galactic extin ction assuming a uniform E(B— V)= 0.027 
dMcCracken et al.l2003l) . moved to the galaxy rest-frame and inter- 
polated onto a common wavelength grid. Regions of the spectra 
with bad pixels are identified using the associated error arrays, re- 
gions with strong night sky lines are included into the mask. Each 
spectrum is normalised, by dividing by the median flux in the good 
pixels, and gaps in the dataset caused by bad pixels are filled with 



the median of all other spectra at that wavelength. The mean spec- 
trum is calculated and subtracted, and PCA is then performed on 
the residuals. The resulting mean spectrum and eigenspectra are 
presented in the first column of FigureQ] 

Clearly the noise level of the eigenspectra resulting from the 
classic PCA is high. The distribution of principal component am- 
plitudes reveals that the signal in many of the eigenspectra is dom- 
inated by a small number of outliers. A natural way to improve on 
this situation is through the iterative removal of these outliers based 
on the principal component amplitude distributions. This procedure 
is essentially the same as the calculation of truncated statistics, e.g., 
the trimmed mean, when one excludes some percentage of the ob- 
jects symmetrically based on their rank. For the dataset in question, 
20 iterations are required to reduce the number of 3a outliers in the 
top 10 eigenspectra to half a dozen, resulting in a total number of 
2675 spectra for the final PCA. For a more thorough analysis, a 
convergence criteria can be employed to indicate when the eigen- 
spectra cease to vary significantly (e.g. jYip et alj|2004ah . 

The resulting mean and eigenspectra from this trimmed PCA 
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Figure 2. Left: The top six normalized eigenvalues as a function of iteration number using the classic PCA. Each eigenvalue represents the amount of 
sample variance carried by the eigenspectrum. Right: The top six normalized eigenvalues as a function of galaxy number for robust streaming PCA shown in 
various colours. The x-axis begins at 200, the size of the dataset used to initialise the eigenbasis. The dashed and dotted lines are runs from different random 
initialisations that converge to the same results. 



are shown in the central column of Figure [T] As well as the eigen- 
spectra being visibly less noisy, the PCA now identifies more phys- 
ical features, linking together in a single eigenspectra those features 
we know to be correlated, e.g., the Balmer Hydrogen line series in 
the second eigenspectrum, and emission lines in the third eigen- 
spectrum. The left hand panel of Figure [2] shows the convergence 
with iteration number of the top six eigenvalues, which represent 
the variance in the dataset described by each corresponding eigen- 
spectrum. The first eigenspectra converges quickly, after only a few 
iterations. The later eigenspectra converge more slowly. 

While this iterative procedure results in a clean set of eigen- 
spectra, the removal of outliers based on single components can 
easily lead to the loss of information from the dataset. This occurs 
when more unusual spectra, which would appear in later eigen- 
spectra, are thrown out as outliers in the top few components. Ad- 
ditionally, running a full PCA for multiple iterations is undeniably 
an inefficient use of computing power, especially for large samples 
like the SDSS, where a single iteration takes about 2 days. We will 
now describe the application of the iterative and robust PCA algo- 
rithm to the same noisy dataset, showing that the same noise free 
and physically interesting eigenspectra are recovered, more quickly 
and without the physical removal of spectra from the dataset. 



size and content of the initialisation dataset and the precise method 
used to initialise a. Changes to a and <5 alter the speed of con- 
vergence and susceptibility to outliers as the algorithm proceeds in 
time. Starting from the 201st spectrum, we input the spectra into the 
robust streaming PCA algorithm. The right hand panel of Figure [2] 
shows the progression of the eigenvalues with spectrum number in 
separate colours from three alternative random initialisation shown 
in solid, dotted and dashed lines. We see that full convergence of 
the top three eigenspectra is reached in less than one round of the 
3485 spectra; naturally eigenspectra which carry less of the sample 
variance converge more slowly as they depend on the lower order 
components, but in this example they still stabilise in less than two 
rounds of iterations. 

The third column of Figure [T] shows the resulting mean and 
top four eigenspectra. In this test case, the eigenspectra are very 
similar to those from the trimmed PCA but minor improvements are 
apparent. It is worth noting that the PCA algorithm is completely 
independent of the order of the bins: it has no spatial coherence. 
Hence the fact that our eigenspectra are smoother than the trimmed 
basis is already an indication of them being more robust. 



3.2 Robust Eigenspectra 

Next we apply the streaming PCA method introduced earlier. For 
the actual implementation, we utilise a Cauchy-type p-function: 

p(t) = ~axctaii(|l) (23) 

and use the scalar c to adjust the asymptotic value of the scale es- 
timate to match the standard deviation of a Gaussian point process. 
First we perform a classic PCA on 200 randomly selected spectra 
to provide the initial eigenbasis, and the initial a 2 estimate (eq|19t 
is calculated from the sum of the residuals between these 200 spec- 
tra and their PCA reconstructions. We set a — 1 — 1/N where N 
is the total number of galaxies in the dataset.We also set 8 = 0.5 
to maximize the breakdown point, which yields c ~ 0.787 for our 
choice of p-function. Our final results are robust to variations in the 



4 DISCUSSION 

There are two important points that an effective spectral eigenbasis 
must obey: (1) the eigenspectra should not introduce noise into the 
decomposition of individual galaxy spectra by being noisy them- 
selves; (2) the top few eigenspectra must primarily describe the 
variance in the majority of the dataset, and not be influenced by 
minority outliers. Additionally, the eigenbasis should be quick to 
calculate and without the need for excessive memory storage. 

Figure[T]illustrates the success of robust statistics for address- 
ing point (1). The second point becomes clear when we investi- 
gate the distribution of the principal component amplitudes of the 
3485 VVDS spectra. In Figure [3] we present the first two princi- 
pal component amplitudes for the VVDS spectra using each of the 
eigenbases. The overall correlation between these first two princi- 
pal component amplitudes for the classic PCA indicates the failure 




Figure 3. The joint distribution of the first two principal component amplitudes for the VVDS collection of 3485 spectra using the eigenbases presented in 
Figure [T] In order to focus on the main sample of objects, the axes are scaled such that outliers are not shown. From left to right: classic PCA using SVD, 
trimmed PCA with iterative removal of outliers, our robust PCA from the randomised streaming algorithm. 



of this eigenbasis to represent the variance in the majority of the 
galaxy spectra: the basis has been influenced by outliers. 

A final, important aspect of the new algorithm is the increase 
in speed. The iterative truncation approach to classic PCA is clearly 
inefficient, although the precise increase in speed will vary depend- 
ing on dataset properties. For our VVDS test case the 20 iterations 
of classic PCA take five times longer than a single iteration of the 
3485 spectra using the robust algorithm. The ratio will naturally 
change in favour of the new technique for larger collections of spec- 
tra. 

Our robust analysis is based on a randomised algorithm, and, 
as such, it assumes that the input data entries are considered in ran- 
dom order both for the initial set and the stream. When this is not 
the case, the method may develop a wrong initial representation of 
the data, which can take many iterations to correct. 

The sensitivity of the algorithm to this issue is primarily deter- 
mined by the parameter p, i.e., the number of principal components 
kept between steps. We expect problems only when this value is 
too low compared to the weights assigned to the new input vectors. 
In general, when studying an unknown dataset, we recommend that 
one randomises the dataset at each iteration and solves for as many 
eigenspectra as possible. 

Having said that, we should also note that special ordering 
during the streaming of the data might prove invaluable for study- 
ing the evolution of the eigensystem as a function of the parameter 
used for sorting the input data. However, such studies should take 
extreme care in choosing the adjustable parameters (e.g., effective 
sample size) and ensure that the observed trends are real and stable 
to initial conditions. 



5 SUMMARY 

We present a novel method for performing PCA on real-life noisy 
and incomplete data. Our analysis is statistically robust, and im- 
plements the current state-of-the-art theoretical approach to gener- 
alising the classic analysis. Our streaming technique improves the 
eigensystem step-by-step when new observations are considered, 



and allows for direct monitoring of the improvement. The conver- 
gence is controlled by a single parameter that sets the effective sam- 
ple size. The relevance of this parameter becomes obvious for very 
large datasets such as the SDSS catalog. These large samples are 
very much redundant in the statistical sense, i.e., often the analysis 
of a smaller subset yields as good results. Our method provides di- 
agnostic tools to ensure convergence while enabling the selection 
of smaller effective sample sizes. The resulting eigenbasis has less 
noise than a classic PCA, and represents the variance in the major- 
ity of the data set without being influenced by outliers. Compared 
to a common work-around for reducing the effect of outliers on 
the eigenbasis by excluding extreme instances analoguosly to the 
trimmed mean calculation, the new algorithm provides a noticable 
improvement in robustness and a substantial increase in speed. A 
production implementatio n within the NVC0 Spectrum Service^ 
(Dobos&Budavari 2008) will be released where users of the site 
and Web services can direct the result sets of queries to the robust 
PCA engine. On this site we will publish the IDL scripts used for 
illustrations in this paper. 
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2 Visit the US National Virtual Observatory site at http://us-vo.org 
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